Walking on Minimax Paths for k-NN Search
نویسندگان
چکیده
Link-based dissimilarity measures, such as shortest path or Euclidean commute time distance, base their distance on paths between nodes of a weighted graph. These measures are known to be better suited to data manifold with nonconvex-shaped clusters, compared to Euclidean distance, so that k-nearest neighbor (NN) search is improved in such metric spaces. In this paper we present a new link-based dissimilarity measure based on minimax paths between nodes. Two main benefits of minimax path-based dissimilarity measure are: (1) only a subset of paths is considered to make it scalable, while Euclidean commute time distance considers all possible paths; (2) it better captures nonconvex-shaped cluster structure, compared to shortest path distance. We define the total cost assigned to a path between nodes as Lp norm of intermediate costs of edges involving the path, showing that minimax path emerges from our Lp norm over paths framework. We also define minimax distance as the intermediate cost of the longest edge on the minimax path, then present a greedy algorithm to compute k smallest minimax distances between a query and N data points in O(logN + k log k) time. Numerical experiments demonstrate that our minimax kNN algorithm reduce the search time by several orders of magnitude, compared to existing methods, while the quality of k-NN search is significantly improved over Euclidean distance. Introduction Given a set of N data points X = {x1, . . . ,xN}, k-nearest neighbor (k-NN) search in metric spaces involves finding k closest points in the dataset X to a query xq . Dissimilarity measure defines distance duv between two data points (or nodes of a weighted graph) xu and xv in the corresponding metric space, and the performance of k-NN search depends on distance metric. Euclidean distance ‖xu − xv‖2 is the most popular measure for k-NN search but it does not work well when data points X lie on a curved manifold with nonconvex-shaped clusters (see Fig. 1(a)). Metric learning (Xing et al. 2003; Goldberger et al. 2005; Weinberger and Saul 2009) optimizes parameters involving the Mahalanobis distance using Copyright c © 2013, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. labeled dataset, such that points in the same cluster become close together and points in the different cluster become far apart. Most of metric learning methods are limited to linear embedding, so that the nonconvex-shaped cluster structure is not well captured (see Fig. 1(b)). Link-based (dis)similarity measures (Fouss et al. 2007; Yen, Mantrach, and Shimbo 2008; Yen et al. 2009; Mantrach et al. 2010; Chebotarev 2011) rely on paths between nodes of a weighted graph, on which nodes are associated with data points and intermediate costs (for instance Euclidean distance) are assigned to edge weights. Distance between nodes depends on the total cost that is computed by aggregating edge weights on a path connecting those nodes of interest. Total cost associated with a path is often assumed to be additive (Yen, Mantrach, and Shimbo 2008), so the aggregation reduces to summation. Dissimilarity between two nodes is calculated by integrating the total costs assigned to all possible paths between them. Such integration is often determined by computing the pseudo-inverse of the graph Laplacian, leading to Euclidean commute time distance (ECTD), regularized Laplacian kernel, and Markov diffusion kernel (see (Fouss et al. 2007; Yen, Mantrach, and Shimbo 2008; Fouss et al. 2012) and references therein), which are known to better capture the nonconvex-shaped cluster structure (see Fig. 1(c)). However, all possible paths between nodes need to be considered to compute the distances and the inversion of N × N matrix requires O(N) time and O(N) space, so it does not scale well to the problems of interest. Shortest path distance (Dijkstra 1959) is also a popular link-based dissimilarity measure (Tenenbaum, de Silva, and Langford 2000), where only shortest paths are considered to compute the distances between two nodes. Computational cost is reduced, but the cluster structure is not well captured when shortest path distance is used for k-NN search (see Fig. 1(d)). The distance between two nodes is computed along the shortest path only, Randomized shortest path (RSP) dissimilarity measures were proposed as a family of distance measures depending on a single parameter, which has interesting property of reducing, on one end, to the standard shortest path distance when the parameter is large, on the other hand, to the commute time distance when the parameter is near zero (Yen, Mantrach, and Shimbo 2008). In this paper we present a new link-based k-NN search method with minimax paths (Pollack 1960; Gower and Ross Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence
منابع مشابه
Improvement of the Effective Components in the PDR Positioning Method Based on Detecting the User’s Movement Mode Using Smartphone Sensors
The purpose of this paper is to evaluate and improve the accuracy of indoor positioning using smartphone sensors based on Pedestrian Dead Reckoning (PDR) method. In some specific situations, such as fires or power outages that disable infrastructure-based positioning techniques, using PDR method based on smartphone sensors that perform positioning continuously is a good solution.This paper focu...
متن کاملOptimal rates for k-NN density and mode estimation
We present two related contributions of independent interest: (1) high-probability finite sample rates for k-NN density estimation, and (2) practical mode estimators – based on k-NN – which attain minimax-optimal rates under surprisingly general distributional conditions.
متن کاملA minimax search algorithm for CDHMM based robust continuous speech recognition
In this paper, we propose a novel implementation of a minimax decision rule for continuous density hidden Markov model based robust speech recognition. By combining the idea of the minimax decision rule with a normal Viterbi search, we derive a recursive minimax search algorithm, where the minimax decision rule is repetitively applied to determine the partial paths during the search procedure. ...
متن کاملAnalysis of k-Nearest Neighbor Distances with Application to Entropy Estimation
Estimating entropy and mutual information consistently is important for many machine learning applications. The Kozachenko-Leonenko (KL) estimator (Kozachenko & Leonenko, 1987) is a widely used nonparametric estimator for the entropy of multivariate continuous random variables, as well as the basis of the mutual information estimator of Kraskov et al. (2004), perhaps the most widely used estima...
متن کاملk-Nearest Neighbors in Uncertain Graphs
Complex networks, such as biological, social, and communication networks, often entail uncertainty, and thus, can be modeled as probabilistic graphs. Similar to the problem of similarity search in standard graphs, a fundamental problem for probabilistic graphs is to efficiently answer k-nearest neighbor queries (k-NN), which is the problem of computing the k closest nodes to some specific node....
متن کامل